Guest tracing improvements to use `tracing` crate #844

dblnz · 2025-08-31T13:46:42Z

Description

This PR closes #723, #704 and partially addresses #318.
These changes modify the way we perform guest tracing to use the tracing crate and its macros (instrument, trace).

How it works

Guest

What makes this possible is the implementation of the Subscriber trait in the hyperlight-guest-tracing crate. By implementing it, we can now handle the capturing of spans and events and choose how to store them and when to export them to the host.

The GuestSubscriber type that implements Subscriber keeps an internal TraceState that holds all the needed information.
Whenever a new span is created, entered or exited, a callback on the subscriber is called so that we can handle the functionality. The same happens with the events also.

Each time a new span or event is added to the internal state, we check whether the buffer got full and send them to the host to process.

Host

When the host detects a VM exit from the guest, it checks whether it contains tracing information in the OutB instruction.
When tracing information is found, the host starts going through it and check against the local storage of spans.

If the spans have previously been created, just update the end timestamp (if present) and add new events (if any).
If they haven't been created, create and store them.

The spans parents are set based on the information got from the host.

TODO

The current issue is with the guest calls that end up calling back into the host.
These do not correctly set the parents of the spans created in the host to the last one created in the guest before doing the VM exit
I need to find a way to propagate the context into the guest and back whenever it is needed. But using the Opentelemetry propagators needs std support which we do not have in the guest.
There is a corner case with calculating the timestamp for the span that is open when a VM exit is done

Jaeger picture of a Guest call that calls back into the host

jprendes

I made a high level review, and what I've seen looks good :-)

src/hyperlight_host/src/hypervisor/hyperv_windows.rs

src/hyperlight_host/src/hypervisor/hyperv_linux.rs

Cargo.toml

src/hyperlight_common/src/outb.rs

src/hyperlight_host/src/hypervisor/hyperv_linux.rs

src/hyperlight_host/src/hypervisor/hyperv_windows.rs

src/hyperlight_host/src/hypervisor/kvm.rs

ludfjig

First round review looks good to me. I am curious what the performance looks like with tracing vs without, maybe we could add a benchmark or something for this?

Also, is there any possibility that we don't flush spans/records after exiting the guest, and that some end up not being emitted?

Another thing to consider is log crate vs tracing crate. Should we ditch one? Or is there some mechanism that allows regular logs to be consumed by tracing crate: And should we expose any of these in guest C-api?

Also is the tracing buffer sizes configurable? Maybe it should be if it isn't, so users can tweak it in case it affects performance.

src/hyperlight_common/Cargo.toml

src/hyperlight_guest/src/exit.rs

src/hyperlight_guest/src/guest_handle/host_comm.rs

src/hyperlight_guest_tracing/src/invariant_tsc.rs

jsturtevant · 2025-09-11T23:17:46Z

I am curious what the performance looks like with tracing vs without, maybe we could add a benchmark or something for this?

+1

dblnz · 2025-09-12T15:27:26Z

First round review looks good to me. I am curious what the performance looks like with tracing vs without, maybe we could add a benchmark or something for this?

Ok, I can do that.

Also, is there any possibility that we don't flush spans/records after exiting the guest, and that some end up not being emitted?

Hmm, in my limited testing I haven't seen this case, but I wouldn't exclude the possibility.
One scenario that comes to mind is cancellation, where the vCPU forced to stop, so no out instruction to deliver the info.
Any ideas how we can treat these scenarios?

Another thing to consider is log crate vs tracing crate. Should we ditch one? Or is there some mechanism that allows regular logs to be consumed by tracing crate: And should we expose any of these in guest C-api?

I am not sure about the best approach is.
The solution I added for guest spans/events only works with opentelemetry Subscribers on the host, so if no opentelemetry subscribers, the they won't be captured (haven't tried it yet, but this is how it should work)

Also is the tracing buffer sizes configurable? Maybe it should be if it isn't, so users can tweak it in case it affects performance.

The tracing buffer is compile time configurable, which I agree is not ideal for customers.
But I needed a fixed size buffer so that I could reliably give the pointer to the host to access memory (current approach, I don't know if it is the best, but it is certainly faster than using the input/output buffer which copies data twice. This way we only copy once, on the host).

dblnz · 2025-09-30T15:57:55Z

First round review looks good to me. I am curious what the performance looks like with tracing vs without, maybe we could add a benchmark or something for this?

I've run some benchmarks locally and here are the results.
Relative to the work done in the dummy guest, the tracing logic could look like it takes a lot, but we need a realistic guest scenario to correctly assess how relevant the numbers are.

Runtime	Strategy	Flavour	RPS	p50 (s)	p95 (s)	p99 (s)	p99.99 (s)	Peak RSS (MB)
hyperlight-dummy	new		69.96	0.7065	1.1569	1.2509	1.3412	12.79
hyperlight-dummy-tracing	new	nolog	70.92	0.6728	1.1247	1.2065	1.3029	20.21
hyperlight-dummy-tracing	new	log	70.14	0.7354	1.0132	1.0953	1.1922	18.54
hyperlight-dummy	reload		93506.17	0.0004	0.0011	0.0018	0.0067	15.00
hyperlight-dummy-tracing	reload	nolog	56152.01	0.0006	0.0023	0.0049	0.0575	23.42
hyperlight-dummy-tracing	reload	log	38585.95	0.0009	0.0035	0.0069	0.0507	21.90
hyperlight-dummy	reuse		80821.05	0.0005	0.0016	0.0026	0.0065	14.85
hyperlight-dummy-tracing	reuse	nolog	54745.61	0.0007	0.0020	0.0049	0.0581	23.28
hyperlight-dummy-tracing	reuse	log	41301.24	0.0009	0.0028	0.0056	0.1097	21.24

Runtimes:

hyperlight-dummy - Hyperlight Sandbox running a dummy guest (no TracingProvider)
hyperlight-dummy-tracing - Hyperlight Sandbox running a dummy guest with trace_guest enabled on both guest and host and a TracingProvider instantiated
- nolog - the max log level is none, which means the guest doesn't send tracing info, but the Host logic to handle them is enabled
- log - the max log level is trace for the guest, which makes the guest send tracing info and the host to handle them

Some thoughts:

This feature started as a development phase improvement to help figure out where we have bottlenecks on the guest.
This is a huge improvement over the previous solution we had.
At this point, I do not think this should be used in production by default because the performance is not ideal. It could be useful on an error path, when something doesn't work as expected, it could be turned on
These numbers do not perfectly reflect only the tracing logic comparison because there are multiple variables such as: the communication with the Jaeger Collector to send the tracing data (when the TracingProvider is installed).
Some user feedback would be perfect
This would benefit from other generic performance improvements, such as (sharing memory with the guest in such a way that doesn't imply copying memory - guest immutable references).
I think we should first discuss where we stand in terms of these big changes in Hyperlight, which may also positively impact the tracing performance.

ludfjig

This looks good to me in general, but I think maybe we should consider what tunables we need

ludfjig · 2025-10-01T22:18:47Z

src/hyperlight_guest_bin/src/lib.rs

-            );
+            // It is important that all the tracing events are produced after the tracing is initialized.
+            #[cfg(feature = "trace_guest")]
+            if max_log_level == LevelFilter::Trace {


why only init if trace level?

This is to avoid the performance penalty if it is not really needed.
This provides a way to selectively enable it at runtime for development purposes.

I wouldn't say it is yet ready for production, but it definitely is a step in the right direction, starting from what we had (something custom that needs maintenance), to something that uses the tracing and opentelemetry crates.

What if I wanted to turn on tracing for LevelFilter::DEBUG or LevelFilter::INFO

Should this be != LevelFilter::Off https://docs.rs/tracing/latest/tracing/level_filters/struct.LevelFilter.html#impl-LevelFilter

ludfjig · 2025-10-01T22:23:39Z

docs/hyperlight-metrics-logs-and-traces.md

+This custom subscriber stores the spans and events in a buffer initialized only when tracing is enabled. For each new span and event, a method is called on the custom subscriber which
+not only stores the data, but also keeps track of the hierarchy and dependencies between the other spans/events.
+
+When the storage space is filled, the guest triggers a VM Exit that sends the guest pointers to the host. The host can access the guest memory, get the data and parse it to create the `spans` and `events` using the `opentelemetry` crate which allows specifying the starting and ending timestamps


Maybe an option could be to only emit logs during regular VMexits in order to minimize context switches. Or maybe another option could be to discard logs when the buffer is full in order to not incur extra context switches

Yes, this is another option.
I prioritized not losing any info, but we might not care about that. A ringbuffer would do the trick here

yes. If this could be configurable that would be ideal i think

That would be a nice feature, but at this moment, the logic on the host relies on the spans not being lost so that it knows how to arrange them correctly.
Each span contains info about it's parent span, if one of those is lost, then there's a gap.

A ringbuffer would work well with logs, but traces, I am not sure.

- This feature is not used separate from the mem_profile - All the unwind logic is now gated by mem_profile Signed-off-by: Doru Blânzeanu <[email protected]>

- The guest side does not use this type of OutBAction - The stack unwinding is done either way when the mem_profile feature is enabled Signed-off-by: Doru Blânzeanu <[email protected]>

Signed-off-by: Doru Blânzeanu <[email protected]>

- This helps with keeping code separate and easily gating it out Signed-off-by: Doru Blânzeanu <[email protected]>

- This steps cleans up codebase for the new way of tracing guests - The current method involves custom macros and logic that are not the best for maintainability Signed-off-by: Doru Blânzeanu <[email protected]>

- Define a separate struct that holds the functionality related to memory profiling of the guest Signed-off-by: Doru Blânzeanu <[email protected]>

- Rename TraceInfo to reflect only being used by mem_profile Signed-off-by: Doru Blânzeanu <[email protected]>

Signed-off-by: Doru Blânzeanu <[email protected]>

- Adds a type that implements the Subscriber trait of the tracing_core crate that allows the type to be set as the global Subscriber of the crate - This way we can handle the adding of new spans and events and store them where/how we want Signed-off-by: Doru Blânzeanu <[email protected]>

- implement add_span and event methods that store the info and report it to the host when the buffer gets full Signed-off-by: Doru Blânzeanu <[email protected]>

Signed-off-by: Doru Blânzeanu <[email protected]>

- Parse the spans and events coming from the guest and create corresponding spans and events from the host that mimics a single call from host - Create a `TraceContext` that handles a call into a guest Signed-off-by: Doru Blânzeanu <[email protected]>

- conditionally handle logs either through tracing or the dedicated VM exit based on whether tracing is initialized on the guest - modify `test_with_small_stack_and_heap` to 18kB because the `#[intrument]` attributes use more stack. Signed-off-by: Doru Blânzeanu <[email protected]>

Signed-off-by: Doru Blânzeanu <[email protected]>

dblnz · 2025-10-15T07:53:15Z

This looks good to me in general, but I think maybe we should consider what tunables we need

I see this version using opentelemetry as an incremental step, after which additional tunables can be added to improve usability.
Unless there is any fundamental issue with this approach, I think this can be merged and improved after.

Improvements:

Switch to using a circular buffer for traces/events to avoid context switches
Provide a mechanism on the guest to enable/disable the tracing support, configure the memory used and log level at runtime:

jsturtevant

Really like the idea of using the industry standard solution for tracing!

jsturtevant · 2025-10-16T17:15:03Z

src/hyperlight_guest_bin/src/lib.rs

-            );
+            // It is important that all the tracing events are produced after the tracing is initialized.
+            #[cfg(feature = "trace_guest")]
+            if max_log_level == LevelFilter::Trace {


What if I wanted to turn on tracing for LevelFilter::DEBUG or LevelFilter::INFO

jsturtevant · 2025-10-16T17:16:28Z

src/hyperlight_guest_bin/src/lib.rs

-            );
+            // It is important that all the tracing events are produced after the tracing is initialized.
+            #[cfg(feature = "trace_guest")]
+            if max_log_level == LevelFilter::Trace {


Should this be != LevelFilter::Off https://docs.rs/tracing/latest/tracing/level_filters/struct.LevelFilter.html#impl-LevelFilter

jsturtevant · 2025-10-16T17:21:25Z

src/hyperlight_guest_tracing/src/visitor.rs

+        let mut k = heapless::String::<FK>::new();
+        let mut val = heapless::String::<FV>::new();
+        // Shorten key and value if they are bigger than the space allocated
+        let _ = k.push_str(&f.name()[..usize::min(f.name().len(), k.capacity())]);


is there a chance the key gets reduced to something that would overwrite and corrupt data?

I don't see how, it is trimmed to the minimum between its length and the capacity of the buffer, so it shouldn't go beyond the capacity of the container.

src/hyperlight_host/src/hypervisor/arch/mod.rs

jsturtevant · 2025-10-16T17:27:49Z

src/hyperlight_host/src/sandbox/initialized_multi_use.rs

        let mut cfg = SandboxConfiguration::default();
        cfg.set_heap_size(20 * 1024);
-        cfg.set_stack_size(16 * 1024);
+        cfg.set_stack_size(18 * 1024);


is this for the extra space needed with traces?

Yes, I discovered lately that this test failed after I rebased to latest main. After some investigation, it happens because of the traces

simongdavies

I've made a couple of comments , I don't think that any of these are things that need to be addressed here but we should consider as future enhancements

simongdavies · 2025-10-20T10:14:55Z

src/hyperlight_guest_bin/src/lib.rs

            let max_log_level = LevelFilter::iter()
                .nth(max_log_level as usize)
                .expect("Invalid log level");
            init_logger(max_log_level);


Should we only call init_logger if either #[not(cfg(feature = "trace_guest"))] or if this is set as an else to if max_log_level != LevelFilter::Off?

I don't think we should log if tracing is on (we should redirect and log::<level>! macros to tracing (will this just happen by default?). Conversely if trace_guest is off we should capture any trace::<level>! macros as logs (see https://docs.rs/tracing/latest/tracing/#emitting-log-records).

Maybe the way we should deal with this is to always emit traces from the guest (and get rid of the logging functionality in the guest (which makes the comment above redundant). Likewise in the host we could always use tracing, that way a consumer would have full control over if they saw logs or traces?

simongdavies · 2025-10-20T10:47:39Z

docs/hyperlight-metrics-logs-and-traces.md

-This will build the guest binaries with the `trace_guest` feature enabled and move them to the appropriate location for use by the host.
+This builds the guest binaries with the `trace_guest` feature enabled and move them to the appropriate location for use by the host.
+
+**NOTE**: To enable the tracing in your application you need to use the `trace_guest` feature on the `hyperlight-guest-bin` and `hyperlight-guest` crates.


I don't think this is strictly true? I can trace the host without this feature?.

Ideally it would be nice to get rid of this feature entirely and have it the functionality enabled by the presence of a subscriber in the host but that is maybe something we can think about in the future

We need a way to capture, store and communicate to the host the captured traces.
By having this custom Subscriber, we have that logic defined by us, so we know the format to expect from the guest.
If we find a way to address this, then we don't need that feature

simongdavies · 2025-10-20T10:57:25Z

src/hyperlight_guest_tracing/src/lib.rs

+};
+
+/// Maximum number of spans that the guest can store
+const MAX_NO_OF_SPANS: usize = 10;


Just wondering if rather than having fixed sizes for these we could just allocate on the guest heap and then have them processed each time we have a natural VM Exit, probably not an issue for this PR, but if we were to unify all the guest logging into traces (see my other comment about this) then this would probably be a better approach.

I agree with this, I would like to expose a way to configure the space used.
So the guest can say what memory it can spare

dblnz requested review from danbugs, devigned, jprendes, ludfjig, marosset, simongdavies and syntactically as code owners August 31, 2025 13:46

dblnz added the kind/enhancement For PRs adding features, improving functionality, docs, tests, etc. label Aug 31, 2025

dblnz changed the title ~~Tracing improvements~~ Guest tracing improvements to use tracing crate Aug 31, 2025

dblnz marked this pull request as draft August 31, 2025 13:48

jprendes reviewed Sep 1, 2025

View reviewed changes

dblnz force-pushed the tracing-improvements branch 2 times, most recently from d4327a8 to ccaa14e Compare September 10, 2025 22:45

dblnz marked this pull request as ready for review September 10, 2025 22:45

dblnz force-pushed the tracing-improvements branch from ccaa14e to cceb069 Compare September 10, 2025 22:47

ludfjig reviewed Sep 11, 2025

View reviewed changes

src/hyperlight_common/Cargo.toml Show resolved Hide resolved

src/hyperlight_guest/src/exit.rs Outdated Show resolved Hide resolved

src/hyperlight_guest/src/guest_handle/host_comm.rs Show resolved Hide resolved

src/hyperlight_guest_tracing/src/invariant_tsc.rs Show resolved Hide resolved

dblnz force-pushed the tracing-improvements branch 6 times, most recently from 77bbba5 to 6d10d2e Compare September 18, 2025 21:27

dblnz force-pushed the tracing-improvements branch from 6d10d2e to 8ff5a1e Compare September 30, 2025 15:25

dblnz force-pushed the tracing-improvements branch from 8ff5a1e to f00d63e Compare October 1, 2025 09:55

ludfjig reviewed Oct 1, 2025

View reviewed changes

dblnz force-pushed the tracing-improvements branch from f00d63e to 148470a Compare October 6, 2025 15:44

[trace-guest] remove unwind_guest feature

d9e6859

- This feature is not used separate from the mem_profile - All the unwind logic is now gated by mem_profile Signed-off-by: Doru Blânzeanu <[email protected]>

dblnz added 11 commits October 13, 2025 17:50

[trace-host] remove unused OutBAction::TraceRecordStack

85e696a

- The guest side does not use this type of OutBAction - The stack unwinding is done either way when the mem_profile feature is enabled Signed-off-by: Doru Blânzeanu <[email protected]>

[trace-host] remove TraceRegister and use X86_64Regs instead

6bb81bf

Signed-off-by: Doru Blânzeanu <[email protected]>

[trace-host] move trace related logic to separate module

5d74552

- This helps with keeping code separate and easily gating it out Signed-off-by: Doru Blânzeanu <[email protected]>

[trace] remove old tracing functionality

166f72d

- This steps cleans up codebase for the new way of tracing guests - The current method involves custom macros and logic that are not the best for maintainability Signed-off-by: Doru Blânzeanu <[email protected]>

[trace-host] refactor mem_profile logic

270a35b

- Define a separate struct that holds the functionality related to memory profiling of the guest Signed-off-by: Doru Blânzeanu <[email protected]>

[trace-host] move mem_profile logic to different file

98b8caf

- Rename TraceInfo to reflect only being used by mem_profile Signed-off-by: Doru Blânzeanu <[email protected]>

[trace-guest] move invariant_tsc logic to separate module

7404c3e

Signed-off-by: Doru Blânzeanu <[email protected]>

[trace-guest] add span and event storage logic

64292ce

- implement add_span and event methods that store the info and report it to the host when the buffer gets full Signed-off-by: Doru Blânzeanu <[email protected]>

[trace-guest] update outb instructions to include trace info

f291635

Signed-off-by: Doru Blânzeanu <[email protected]>

dblnz force-pushed the tracing-improvements branch from 148470a to 908f482 Compare October 13, 2025 16:00

dblnz added 3 commits October 14, 2025 16:09

[trace-guest] add spans in simpleguest sample

052c2a2

Signed-off-by: Doru Blânzeanu <[email protected]>

update tracing docs

6468526

Signed-off-by: Doru Blânzeanu <[email protected]>

dblnz force-pushed the tracing-improvements branch from 908f482 to 6468526 Compare October 14, 2025 13:29

jsturtevant reviewed Oct 16, 2025

View reviewed changes

simongdavies approved these changes Oct 20, 2025

View reviewed changes

Guest tracing improvements to use tracing crate #844

Are you sure you want to change the base?

Guest tracing improvements to use tracing crate #844

Conversation

dblnz commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How it works

Guest

Host

TODO

Uh oh!

jprendes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ludfjig left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jsturtevant commented Sep 11, 2025

Uh oh!

dblnz commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dblnz commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ludfjig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dblnz commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsturtevant left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simongdavies left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Guest tracing improvements to use `tracing` crate #844

Guest tracing improvements to use `tracing` crate #844

dblnz commented Aug 31, 2025 •

edited

Loading

ludfjig left a comment •

edited

Loading

dblnz commented Sep 12, 2025 •

edited

Loading

dblnz commented Sep 30, 2025 •

edited

Loading

dblnz commented Oct 15, 2025 •

edited

Loading